ENH: add masked algorithm for mean() function #34814

Akshatt · 2020-06-16T02:57:34Z

closes ENH: add masked algorithm for mean() #34754
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pep8speaks · 2020-06-16T02:57:38Z

Hello @Akshatt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-28 13:40:51 UTC

jorisvandenbossche · 2020-06-16T06:15:05Z

See here:

pandas/pandas/core/arrays/masked.py

Lines 337 to 339 in 4628665

    
           if name in {"sum", "prod", "min", "max"}: 
        
               op = getattr(masked_reductions, name) 
        
               return op(data, mask, skipna=skipna, **kwargs)

You can add "mean" there

Akshatt · 2020-06-16T13:16:46Z

@jorisvandenbossche,
The new mean function takes significantly less time

Should tests be added? and if yes then where?
and same for whatsnew entry

jreback

pls add tests (we likely aready have some setup for sum, you can just add onto those)

jreback · 2020-06-16T21:47:15Z

can you add some asv's for the existing and this new case (perf metrics)

Akshatt · 2020-06-17T11:30:24Z

@jreback

pls add tests (we likely aready have some setup for sum, you can just add onto those)

Alright, so I need to add tests in

test_reductions
test_dtypes
Correct me if im wrong or need to add in more tests.

can you add some asv's for the existing and this new case (perf metrics)

Where exactly do I add the asv's in?

jorisvandenbossche · 2020-06-17T11:45:17Z

This operation is already covered by

pandas/asv_bench/benchmarks/series_methods.py

Lines 250 to 280 in 107ad15

    
           class NanOps: 
        
               params = [ 
        
                   [ 
        
                       "var", 
        
                       "mean", 
        
                       "median", 
        
                       "max", 
        
                       "min", 
        
                       "sum", 
        
                       "std", 
        
                       "sem", 
        
                       "argmax", 
        
                       "skew", 
        
                       "kurt", 
        
                       "prod", 
        
                   ], 
        
                   [10 ** 3, 10 ** 6], 
        
                   ["int8", "int32", "int64", "float64", "Int64", "boolean"], 
        
               ] 
        
               param_names = ["func", "N", "dtype"] 
        
               def setup(self, func, N, dtype): 
        
                   if func == "argmax" and dtype in {"Int64", "boolean"}: 
        
                       # Skip argmax for nullable int since this doesn't work yet (GH-24382) 
        
                       raise NotImplementedError 
        
                   self.s = Series([1] * N, dtype=dtype) 
        
                   self.func = getattr(self.s, func) 
        
               def time_func(self, func, N, dtype): 
        
                   self.func()

(I added Int64 in general for all ops, so including mean, when I added a masked sum/prod algorithm)

So no need to write additional benchmarks I think.

For tests, it would be good to add it here:

pandas/pandas/tests/reductions/test_reductions.py

Line 683 in 5fdd6f5

def test_ops_consistency_on_empty(self, method):

(we mainly need to check the behaviour on empty)

pandas/core/array_algos/masked_reductions.py

jorisvandenbossche · 2020-06-17T11:53:17Z

We are also going to need to decide what to return for an empty array/Series. Right now we return np.nan, but this can also be pd.NA. This touches a bit the discussion we had about NaN vs NA for the floating dtype, but I suppose for now we probably want to go with pd.NA (that's what we do now).

Akshatt · 2020-06-17T13:12:44Z

@jorisvandenbossche,

We are also going to need to decide what to return for an empty array/Series. Right now we return np.nan, but this can also be pd.NA. This touches a bit the discussion we had about NaN vs NA for the floating dtype, but I suppose for now we probably want to go with pd.NA (that's what we do now).

This would involve

checking if the mean is np.nan and then
returning pd.NA instead in the mean function itself
yes?

dsaxton · 2020-09-15T16:49:28Z

This would involve

checking if the mean is np.nan and then

returning pd.NA instead in the mean function itself
yes?

@Akshatt You could also do a length check before any calculation as with other functions here, e.g. something like not values.size

Akshatt · 2020-09-15T17:33:05Z

This would involve

checking if the mean is np.nan and then

returning pd.NA instead in the mean function itself
yes?

@Akshatt You could also do a length check before any calculation as with other functions here, e.g. something like not values.size

Hi @dsaxton, thanks for replying!

if not skipna: 
     if mask.any() or not values.size:
         return libmissing.NA

This would suffice, right?

dsaxton · 2020-09-15T17:47:24Z

@Akshatt Since we're always returning NA on empty input regardless of skipna could also

if not values.size:
    return pd.NA

# logic for non-empty input

pandas/core/array_algos/masked_reductions.py

dsaxton · 2020-09-15T18:41:49Z

pandas/tests/reductions/test_reductions.py

@@ -688,6 +688,12 @@ def test_ops_consistency_on_empty(self, method):
        result = getattr(Series(dtype=float), method)()
        assert pd.isna(result)

+        # Empty Mean
+        if method == "mean":


I think this is already being tested directly above? Probably want tests for non-empty input also (i.e., non-empty with some missing, non-empty with no missing, non-empty with all missing, parametrizing over skipna).

Yes redundant now. Previously np.nan was being returned for empty values.

Should i add those tests here?

These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under /pandas/tests/series/test_arithmetic.py (they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.

As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:

[ins] In [1]: import pandas as pd [ins] In [2]: arr = pd.array([1, 2, 3]) [ins] In [3]: arr.sum() Out[3]: 6 [ins] In [4]: arr.mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-057c7202e3e0> in <module> ----> 1 arr.mean() AttributeError: 'IntegerArray' object has no attribute 'mean'

These cases would I hope already be tested for Series but maybe doesn't hurt to add some more tests under /pandas/tests/series/test_arithmetic.py (they could always be deduplicated later if already tested) since I believe masked arrays boxed inside Series hit this path. You could test means of both nullable integer and boolean dtypes.

I'm not sure exactly where to add the tests? In /pandas/tests/series/test_arithmetic.py should I add them in Unsorted section for now?
Also, I would be adding tests for:

Nullable integer dtype

Nullable boolean dtype
yes?
and how do I check if the tests are passing?

As a follow-up enhancement would also be nice to use this to directly implement mean on masked arrays:

[ins] In [1]: import pandas as pd [ins] In [2]: arr = pd.array([1, 2, 3]) [ins] In [3]: arr.sum() Out[3]: 6 [ins] In [4]: arr.mean() --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-057c7202e3e0> in <module> ----> 1 arr.mean() AttributeError: 'IntegerArray' object has no attribute 'mean'

yes that makes sense, although I'm not sure how to go about doing this.

Also probably good to merge master since it's been a while

@dsaxton
I have added these tests under

pandas/pandas/tests/reductions/test_reductions.py

Line 683 in 5fdd6f5

def test_ops_consistency_on_empty(self, method):

Any quick fixes? also, how do I ensure these tests are passing?

Hi @dsaxton,
pushed to branch named tests. Please have a look

@Akshatt please push it to this branch (the same as this PR is based on, so which is your master branch), then we can see the tests here in the PR

Yes, sorry @Akshatt if that was confusing. I meant in the future it's good to avoid modifying your master branch, not to work on a separate branch for this PR.

Regarding your test you'll also want to explicitly assert result is pd.NA which (oddly enough) is slightly different from pd.isna(result).

pandas/core/array_algos/masked_reductions.py

github-actions · 2020-10-25T00:17:45Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

dsaxton · 2020-10-25T02:04:36Z

@Akshatt Is this still active? Can you address previous comments if so?

jreback · 2020-12-23T15:15:49Z

adding min_count ?

As I already answered, min_count is not applicable here. So I think it is only the whatsnew note that remains code-wise

this is fine (you are repeating an old commment which i agree is not relevant).

the asv's and whatsnew are still needed.

jorisvandenbossche · 2020-12-23T15:17:05Z

you are repeating an old commment which i agree is not relevant

Jeff, I am answering @Akshatt because he asked exactly that 2 hours ago

jorisvandenbossche · 2020-12-23T15:33:24Z

Since setting up ASV is not easiest thing to do, I quickly ran the aforementioned benchmark that covers this case (asv run --python=same -b "series_methods\.NanOps\.time_func\('mean'" on master vs this branch):

master:

[100.00%] ··· series_methods.NanOps.time_func              ok
[100.00%] ··· ======== ========= ============ ============= ============= ============= ============ ============
              --                                                      dtype                                      
              ------------------ --------------------------------------------------------------------------------
                func       N         int8         int32         int64        float64       Int64       boolean   
              ======== ========= ============ ============= ============= ============= ============ ============
                mean      1000    36.3±0.3μs    21.2±0.1μs    21.0±0.1μs   21.0±0.06μs   40.7±0.2μs   46.1±0.2μs 
                mean    1000000    578±3μs     1.06±0.04ms   1.14±0.04ms    1.26±0.2ms   1.87±0.2ms   1.63±0.2ms 
              ======== ========= ============ ============= ============= ============= ============ ============

this branch:

[100.00%] ··· ======== ========= =========== ============ ============ ============ ============= =============
              --                                                     dtype                                     
              ------------------ ------------------------------------------------------------------------------
                func       N         int8       int32        int64       float64        Int64        boolean   
              ======== ========= =========== ============ ============ ============ ============= =============
                mean      1000     47.0±5μs   23.2±0.9μs    25.1±3μs    22.5±0.7μs     15.6±1μs      19.3±1μs  
                mean    1000000   779±200μs   1.23±0.3ms   1.36±0.3ms   1.46±0.4ms   1.45±0.08ms   1.38±0.03ms 
              ======== ========= =========== ============ ============ ============ ============= =============

So you see a modest speedup for the Int64 / boolean case (although apparently for larger data, the relative benefit diminishes)

jreback · 2020-12-23T15:47:04Z

you are repeating an old commment which i agree is not relevant

Jeff, I am answering @Akshatt because he asked exactly that 2 hours ago

as did i but ok

Akshatt · 2020-12-24T11:47:46Z

Hi, sorry to have caused all the misunderstanding.
@jorisvandenbossche, thanks for running the asv benchmark as I have no experience in doing it.
So, all that's left is to add the entry in the Whats new in v1.3.0.rst under Performance improvements like so
" Performance improvement in :meth:array.mean (:issue:34814) ". is this okay? Should I attach the asv results as well?

jorisvandenbossche · 2020-12-24T12:10:09Z

So, all that's left is to add the entry in the Whats new in v1.3.0.rst under Performance improvements like so
" Performance improvement in :meth:array.mean (:issue:34814) ".

Yes, that's all.

doc/source/whatsnew/v1.3.0.rst

jreback · 2021-01-01T21:46:52Z

thanks @Akshatt

Akshatt mentioned this pull request Jun 16, 2020

ENH: add masked algorithm for mean() #34754

Closed

Akshatt marked this pull request as ready for review June 16, 2020 08:48

jreback requested changes Jun 16, 2020

View reviewed changes

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jun 16, 2020

jorisvandenbossche reviewed Jun 17, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Outdated Show resolved Hide resolved

pandas/core/array_algos/masked_reductions.py Outdated Show resolved Hide resolved

Akshatt force-pushed the master branch from 07706a5 to aea7eff Compare June 17, 2020 12:34

Akshatt force-pushed the master branch from aea7eff to 0c4e7bb Compare June 26, 2020 04:24

Akshatt requested review from jreback and jorisvandenbossche July 13, 2020 11:46

simonjayhawkins mentioned this pull request Sep 15, 2020

CI: Add stale PR action #36336

Merged

simonjayhawkins added the Needs Review label Sep 15, 2020

dsaxton reviewed Sep 15, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Show resolved Hide resolved

dsaxton reviewed Sep 15, 2020

View reviewed changes

pandas/core/array_algos/masked_reductions.py Outdated Show resolved Hide resolved

dsaxton removed the Needs Review label Sep 15, 2020

dsaxton mentioned this pull request Sep 23, 2020

BUG: DataFrame.mean(axis=1) fails but Series.mean works incase of pandas Nullable integer dtype #36585

Open

3 tasks

github-actions bot added the Stale label Oct 25, 2020

Akshatt force-pushed the master branch from 6f2e25c to 5798070 Compare December 24, 2020 19:12

jreback requested changes Dec 24, 2020

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

Akshatt added 15 commits December 26, 2020 18:03

ENH: add masked mean function

eb5a40d

Indentation

8788b15

ENH: masked mean functioning

30ea646

Blankline error correction

98f013a

Separate lines in calculation

e017657

modified

9f16d27

added empty values check

123838d

fixed linting error and removed redundant test

95ae20c

tests on empty and nan for masked series

9948bbe

modified assert statement

a9b4287

added parameterized test for empty/all-na series mean

13b985a

removed objects dtype

b036b3b

float64 to Float64 renamed, func renamed

ebf8803

Added entry in What's new v1.3.0

f9b17f7

changed array.mean to Series.mean with ExtensionDtype columns

c02d6ac

Akshatt force-pushed the master branch from 5798070 to c02d6ac Compare December 26, 2020 12:40

jorisvandenbossche reviewed Dec 28, 2020

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

Update doc/source/whatsnew/v1.3.0.rst

f8e80a1

jreback approved these changes Jan 1, 2021

View reviewed changes

jreback merged commit 2a60c56 into pandas-dev:master Jan 1, 2021

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

ENH: add masked algorithm for mean() function (pandas-dev#34814)

c4e9572

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add masked algorithm for mean() function #34814

ENH: add masked algorithm for mean() function #34814

Akshatt commented Jun 16, 2020 •

edited

Loading

pep8speaks commented Jun 16, 2020 •

edited

Loading

jorisvandenbossche commented Jun 16, 2020

Akshatt commented Jun 16, 2020

jreback left a comment

jreback commented Jun 16, 2020

Akshatt commented Jun 17, 2020

jorisvandenbossche commented Jun 17, 2020

jorisvandenbossche commented Jun 17, 2020 •

edited

Loading

Akshatt commented Jun 17, 2020

dsaxton commented Sep 15, 2020

Akshatt commented Sep 15, 2020

dsaxton commented Sep 15, 2020

dsaxton Sep 15, 2020

Akshatt Sep 15, 2020

Akshatt Sep 15, 2020

dsaxton Sep 15, 2020

Akshatt Oct 25, 2020

dsaxton Oct 25, 2020

Akshatt Oct 26, 2020 •

edited

Loading

Akshatt Oct 30, 2020

jorisvandenbossche Oct 30, 2020

dsaxton Oct 30, 2020

github-actions bot commented Oct 25, 2020

dsaxton commented Oct 25, 2020

jreback commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jreback commented Dec 23, 2020

Akshatt commented Dec 24, 2020

jorisvandenbossche commented Dec 24, 2020

jreback commented Jan 1, 2021

ENH: add masked algorithm for mean() function #34814

ENH: add masked algorithm for mean() function #34814

Conversation

Akshatt commented Jun 16, 2020 • edited Loading

pep8speaks commented Jun 16, 2020 • edited Loading

Comment last updated at 2020-12-28 13:40:51 UTC

jorisvandenbossche commented Jun 16, 2020

Akshatt commented Jun 16, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jun 16, 2020

Akshatt commented Jun 17, 2020

jorisvandenbossche commented Jun 17, 2020

jorisvandenbossche commented Jun 17, 2020 • edited Loading

Akshatt commented Jun 17, 2020

dsaxton commented Sep 15, 2020

Akshatt commented Sep 15, 2020

dsaxton commented Sep 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Akshatt Oct 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 25, 2020

dsaxton commented Oct 25, 2020

jreback commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jreback commented Dec 23, 2020

Akshatt commented Dec 24, 2020

jorisvandenbossche commented Dec 24, 2020

jreback commented Jan 1, 2021

Akshatt commented Jun 16, 2020 •

edited

Loading

pep8speaks commented Jun 16, 2020 •

edited

Loading

jorisvandenbossche commented Jun 17, 2020 •

edited

Loading

Akshatt Oct 26, 2020 •

edited

Loading